Predicting Stock Market Volatility with Machine Learning

Spring 2025 Capstone Project

Author

Kevin Izadi

Financial market data visualization with candlestick charts and trading indicators. Credit: Nicholas Cappello on Unsplash

Project Overview

This capstone project explores machine learning approaches to predict stock market volatility, focusing specifically on the SPY ETF that tracks the S&P 500 index. By leveraging historical price data and technical indicators, I’ve developed models that forecast future volatility with measurable accuracy.

In today’s unpredictable financial world, anticipating market volatility can provide significant strategic advantages for traders, portfolio managers, and risk analysts. My project began with a fundamental question: can historical patterns reliably predict future market turbulence?

Background

Understanding and forecasting volatility presents a unique challenge due to the complex, non-linear nature of financial markets. Unlike price prediction, volatility forecasting focuses on the magnitude of price movements rather than their direction.

Traditional finance theory suggests markets should be efficient and largely unpredictable. However, decades of research have revealed persistent patterns in volatility behavior, particularly the tendency of volatility to cluster – periods of high turbulence often follow other volatile periods, while calm markets frequently remain stable for extended intervals.

I chose to focus on the SPY ETF because it represents the broad U.S. market, offering high liquidity, extensive historical data, and significance for portfolio management and derivatives pricing. As the world’s most heavily traded ETF, it provides an ideal testing ground for volatility prediction models that might later be extended to other securities.

Problem Statement

This study investigates whether advanced machine learning techniques can identify patterns in historical market data to forecast future volatility more accurately than traditional statistical methods. The central research questions are:

  1. Can neural networks effectively capture the non-linear dynamics of market volatility?
  2. Which technical and fundamental features provide the most predictive value?
  3. How does model performance vary across different market regimes?
  4. What practical applications emerge from improved volatility forecasts?

The research has implications for options pricing, risk management, portfolio construction, and trading strategy development. Accurate volatility forecasts could help investors better time their hedging activities, optimize portfolio allocations, and potentially develop trading strategies that capitalize on expected changes in market turbulence.

Note

Throughout this project, I maintained strict separation between training and testing data to prevent look-ahead bias – a critical consideration in financial modeling.

Data Collection and Preparation

This project utilizes comprehensive historical data for the SPY ETF spanning from 2010 to 2023, collected via the yfinance API. The dataset encompasses daily price information (Open, High, Low, Close, Volume) along with VIX index data to capture market volatility sentiment.

The extended historical timeframe was chosen deliberately to expose the models to diverse market conditions. The dataset includes the post-financial crisis recovery, the bull market of the 2010s, the COVID-19 crash and subsequent recovery, and various periods of both extreme calm and heightened turbulence. This diversity helps ensure that any patterns identified by the models are robust across different market environments.

Data Collection Process

The data collection process involved several components. First, I retrieved historical SPY data using the yfinance API, ensuring coverage from 2010 through the end of 2023. This provided the core price and volume metrics needed for analysis. The data was supplemented with VIX index values, which serve as a widely recognized measure of expected market volatility.

Technical indicators were calculated using established financial analysis libraries. These included various moving averages that help identify trends, momentum indicators like RSI and MACD that capture overbought or oversold conditions, and volume metrics that provide insights into the conviction behind price movements.

The VIX index provides particularly valuable information as it represents the market’s expectation of 30-day forward-looking volatility. By combining actual historical price movements with this forward-looking sentiment measure, the models gain a more comprehensive view of market dynamics.

Dataset Size and Missing Data Handling

The final dataset contained approximately 3,500 trading days spanning 14 years (2010-2023). This timeframe was chosen to capture multiple market cycles and volatility regimes, providing sufficient data for both training and out-of-sample testing. Market holidays, weekends, and other non-trading days were naturally excluded from the dataset as they weren’t present in the yfinance API data.

Missing values were relatively rare in the SPY price data (less than 0.3% of observations), but more common in some derived indicators and the VIX data (approximately 1.2% of observations). These gaps typically occurred around market holidays or due to rare data reporting issues. To maintain data integrity, I implemented the following approach:

  1. For missing price data: Forward-fill imputation was used to carry the last available price forward, which is consistent with how markets behave when closed.
  2. For missing VIX values: A combination of forward-fill for single missing days and linear interpolation for multi-day gaps was employed.
  3. For derived features: These were calculated only after the base data was cleaned to prevent propagating missing values.

This careful handling ensured that the dataset remained continuous and temporally consistent, which is critical for time-series models.

Data Preprocessing Steps

Financial time series data requires careful preprocessing to maintain its temporal integrity and prevent information leakage. First, missing values were addressed through forward fill imputation, which replaces missing data points with the last available value. This approach is appropriate for financial markets where the last known price is typically the best estimate until new information arrives.

Feature scaling was implemented using MinMaxScaler to normalize all features to a common range, improving model training stability and convergence. Crucially, the scaling parameters were determined solely from training data and then applied to test data, preventing any information from the future from influencing the model’s inputs.

The train-test split was implemented chronologically rather than randomly, respecting the time-series nature of financial data. This approach ensures that models are only trained on past data and evaluated on future periods they haven’t seen, mimicking real-world application scenarios.

Special attention was paid to feature engineering to prevent data leakage. When calculating features like moving averages or volatility measures, only data points that would have been available at the time of prediction were used. This careful boundary maintenance is essential for developing models that can perform reliably in practice rather than merely appearing successful in backtests.

Required Packages for Implementation

The following packages were used for data collection, preprocessing, modeling, and visualization:

Required Packages for Implementation
# Data collection and manipulation
import yfinance as yf
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Data preprocessing and model evaluation
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.metrics import (
    mean_squared_error, 
    r2_score, 
    mean_absolute_error, 
    explained_variance_score
)

# Models
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import xgboost as xgb
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Visualization
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import plotly.graph_objects as go

# Utility functions
import warnings
warnings.filterwarnings('ignore')
import joblib

These packages cover the entire data science workflow from data acquisition through preprocessing, model building, evaluation, and visualization. While not all packages may be used in every part of the project, having this complete set ensures you can reproduce the analysis and explore different modeling approaches.

The core dependencies can be installed via pip with:

pip install pandas numpy matplotlib seaborn scikit-learn xgboost tensorflow yfinance plotly joblib ipywidgets

For TA-Lib (technical analysis library), installation can be more complex depending on your system. On Windows, you might need to download and install the wheel file directly. On Linux/Mac, you can typically use:

pip install TA-Lib

or

conda install -c conda-forge ta-lib
Code for Data Collection
import yfinance as yf
import pandas as pd
import numpy as np

# Download historical SPY data (2010-2023)
start_date = "2010-01-01"
end_date = "2023-12-31"

# Download SPY data
df = yf.download("SPY", start=start_date, end=end_date)

# Download VIX data and merge with SPY data
vix = yf.download("^VIX", start=start_date, end=end_date)
df["VIX"] = vix["Close"]

Note: I’ve included the code here for reference but disabled its execution to reduce computational requirements.

Feature Engineering

Feature engineering is a critical component of this project, representing the intersection of financial domain knowledge and data science techniques. Rather than relying solely on raw price data, I developed a comprehensive set of features designed to help the models recognize and respond to various market conditions and patterns.

Types of Features Created

The feature set was organized into several categories, each capturing different aspects of market behavior:

Price-based features form the foundation of the analysis, including returns calculated across different timeframes (daily, weekly, monthly) to capture both immediate and longer-term momentum. Various moving averages and their relationships provide trend information, while price ranges and relative positions help identify potential support and resistance levels.

Volatility indicators are particularly important given the project’s focus. Historical volatility was calculated using rolling standard deviations of returns over different windows (10-day, 21-day, 63-day) to capture short, medium, and longer-term volatility regimes. The VIX index values provide an additional dimension by incorporating market expectations of future volatility.

Technical indicators add substantial value by encoding patterns that traders have found useful over decades. These include the Relative Strength Index (RSI) to measure overbought or oversold conditions, Moving Average Convergence Divergence (MACD) for trend strength and direction, Bollinger Bands to identify volatility-based support and resistance, and various volume-based indicators that help gauge the conviction behind price movements.

Lagged features play a crucial role in capturing time-series dependencies. By including lagged versions of key metrics like returns, volatility measures, and technical indicators, the models can identify temporal patterns and autocorrelations that are characteristic of financial markets. Volatility clustering—the tendency for volatile periods to persist—is particularly well captured through these lagged features.

Calendar features were included to account for potential seasonal effects in market behavior. These include day-of-week indicators, month-of-year variables, and quarter designations. While these features ultimately proved less impactful than market-derived indicators, they enable the models to capture any consistent seasonal patterns that might exist.

Market Visualization Techniques

Visualizing market data provides insights that informed feature engineering:

Figure 1: Market Regimes and Volatility Clustering in SPY

Multi-dimensional Market Analysis

The market regime visualization above demonstrates how volatility clusters into distinct periods (high, medium, and low) and shows the relationship between price action, volume, and volatility. This visualization was critical to the project as it directly informed our feature engineering approach by revealing:

  1. Volatility Clustering Patterns: The tendency for volatility to persist in regimes, which we capture through lagged features and rolling volatility windows
  2. Regime Transitions: The identifiable shifts between volatility states, which we model using moving average crossovers and volatility breakouts
  3. Volume-Volatility Relationships: The correlation between trading volume and volatility, which we incorporate through volume-based features
  4. Price-Volatility Dynamics: How price behavior changes across different volatility environments, informing our decision to include regime-specific features

These insights directly influenced our choice of technical indicators and the specific lookback periods used in our predictive models.

Feature Selection Process

Feature selection combined statistical techniques with domain knowledge:

Figure 2: Feature Importance from Random Forest Model

The Random Forest feature importance analysis revealed several key insights:

  1. Market Sentiment Dominates: The VIX index and its recent changes account for nearly 50% of the model’s predictive power, confirming the “fear gauge” reputation of this indicator.

  2. Historical Volatility Matters: Recent realized volatility provides substantial predictive information, supporting the volatility clustering concept.

  3. Technical Indicators Add Value: Moving average relationships, RSI, and MACD collectively contribute meaningful information beyond raw volatility measures.

  4. Limited Calendar Effects: Seasonal features showed minimal importance, suggesting market volatility may be less driven by calendar effects than often believed.

Code for Feature Engineering
# Calculate returns at different timeframes
df['daily_return'] = df['Close'].pct_change()
df['weekly_return'] = df['Close'].pct_change(5)
df['monthly_return'] = df['Close'].pct_change(21)

# Calculate volatility (21-day rolling standard deviation of returns)
df['realized_volatility'] = df['daily_return'].rolling(window=21).std() * np.sqrt(252)

# Create lagged features
for lag in [1, 2, 3, 5, 10, 21]:
    df[f'return_lag_{lag}'] = df['daily_return'].shift(lag)
    df[f'volatility_lag_{lag}'] = df['realized_volatility'].shift(lag)
    
# Moving averages
for window in [5, 10, 20, 50, 200]:
    df[f'ma_{window}'] = df['Close'].rolling(window=window).mean()
    
# Technical indicators implementation
def calculate_rsi(prices, window=14):
    """
    Calculate Relative Strength Index
    """
    # Calculate price changes
    delta = prices.diff()
    
    # Separate gains and losses
    gain = delta.where(delta > 0, 0)
    loss = -delta.where(delta < 0, 0)
    
    # Calculate rolling averages
    avg_gain = gain.rolling(window=window).mean()
    avg_loss = loss.rolling(window=window).mean()
    
    # Calculate RS
    rs = avg_gain / avg_loss
    
    # Calculate RSI
    rsi = 100 - (100 / (1 + rs))
    
    return rsi

def calculate_macd(prices, fast=12, slow=26, signal=9):
    """
    Calculate MACD, Signal line, and Histogram
    """
    # Calculate EMAs
    fast_ema = prices.ewm(span=fast, adjust=False).mean()
    slow_ema = prices.ewm(span=slow, adjust=False).mean()
    
    # Calculate MACD line
    macd_line = fast_ema - slow_ema
    
    # Calculate signal line
    signal_line = macd_line.ewm(span=signal, adjust=False).mean()
    
    # Calculate histogram
    histogram = macd_line - signal_line
    
    return macd_line, signal_line, histogram

# Apply technical indicators
df['rsi_14'] = calculate_rsi(df['Close'], 14)
df['macd'], df['macd_signal'], df['macd_hist'] = calculate_macd(df['Close'])

# Add Bollinger Bands
def calculate_bollinger_bands(prices, window=20, num_std=2):
    """
    Calculate Bollinger Bands
    """
    rolling_mean = prices.rolling(window=window).mean()
    rolling_std = prices.rolling(window=window).std()
    
    upper_band = rolling_mean + (rolling_std * num_std)
    lower_band = rolling_mean - (rolling_std * num_std)
    
    return upper_band, rolling_mean, lower_band

df['bb_upper'], df['bb_middle'], df['bb_lower'] = calculate_bollinger_bands(df['Close'])
df['bb_width'] = (df['bb_upper'] - df['bb_lower']) / df['bb_middle']  # Normalized width

Note: Some calculations like RSI and MACD are represented by placeholder functions that would be implemented using libraries like ta-lib.

Model Implementation

For this project, I implemented several machine learning models to predict volatility. The primary objective was to create a model that could accurately capture the complex, non-linear dynamics of market volatility while remaining interpretable enough for practical application.

Choosing the Right Model

When selecting a model architecture, I needed to balance complexity against interpretability, computational efficiency, and the risk of overfitting. Linear models provide simplicity and transparency but often struggle with the inherently non-linear nature of financial markets. Deep learning approaches can model complex relationships but require extensive data and are prone to overfitting.

After careful consideration, I selected tree-based ensemble methods as my primary approach. These models effectively handle non-linear relationships without extensive feature preprocessing and can capture feature interactions automatically. They also provide useful feature importance metrics that align with domain knowledge.

I tested several model architectures:

  • Random Forest: This served as my primary model, offering a good balance between performance and interpretability
  • XGBoost: Implemented to leverage gradient boosting’s ability to improve prediction accuracy
  • LSTM neural network: Used to capture sequential patterns in the volatility time series
  • Linear Regression: Implemented as a baseline for comparison

The Random Forest consistently demonstrated the most stable performance across different market regimes and evaluation periods, so I focused on optimizing this architecture further.

Model Implementation Details

Each model was implemented with the following configurations:

Random Forest: - 200 trees with a maximum depth of 20 - Minimum samples split of 5 and minimum samples leaf of 2 - Feature selection using mean decrease in impurity - Out-of-bag samples for error estimation - Randomized feature selection at each split (max_features=‘sqrt’)

XGBoost: - 300 boosting rounds with a learning rate of 0.05 - Maximum depth of 6 and minimum child weight of 2 - Subsample ratio of 0.8 and column sample by tree of 0.8 - L1 regularization (alpha) of 0.01 and L2 regularization (lambda) of 1.0 - Early stopping based on validation set performance with a patience of 20 rounds

LSTM Neural Network: - Architecture: Input layer → LSTM(64) → Dropout(0.2) → LSTM(32) → Dropout(0.2) → Dense(16) → Dense(1) - Bidirectional LSTM layers to capture both forward and backward temporal dependencies - Look-back window of 30 trading days to capture monthly patterns - Trained using Adam optimizer with a learning rate of 0.001 - Mean squared error loss function with early stopping (patience=20) - Batch size of 32 and training for a maximum of 100 epochs

ARIMA (benchmark): - Automatically determined parameters using AIC minimization - Typical configuration: ARIMA(2,1,2) based on data stationarity tests - Rolling window retraining every 63 trading days (quarterly)

Each model underwent hyperparameter tuning using a time-series cross-validation approach to prevent look-ahead bias. The final configurations represented the optimal balance between predictive accuracy and generalization ability.

Walk-Forward Validation

Traditional cross-validation methods are problematic for time series data as they can introduce look-ahead bias. To address this challenge, I implemented a walk-forward validation approach that respects the temporal nature of financial data:

  1. I trained models on a 3-year rolling window of historical data
  2. Each model was evaluated on the subsequent 3 months of unseen data
  3. The window was shifted forward by 3 months, and the process repeated

This approach mirrors how the model would be used in practice – training on available historical data and making predictions for future periods. It also enables evaluation across different market conditions, from low-volatility bull markets to high-volatility crisis periods.

Hyperparameter Tuning

The Random Forest model was optimized through hyperparameter tuning:

Hyperparameter Tuning Code
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

# Create and fit the grid search
rf = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, 
                          cv=tscv, scoring='neg_mean_squared_error',
                          verbose=1, n_jobs=-1)

grid_search.fit(X_train, y_train)

# Get best parameters
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

The optimal parameters varied slightly depending on the time period, but generally favored deeper trees, a moderate number of estimators, and smaller leaf sizes. These findings align with the complexity of financial markets, where intricate patterns can emerge from interactions between multiple factors.

Model Feature Importance Analysis

Figure 3: Partial Dependence Plots for Key Features

The partial dependence plots reveal several important relationships between key features and volatility predictions:

  1. VIX Index: Shows a strong positive non-linear relationship with predicted volatility. The impact accelerates as VIX increases above 25, indicating that the model recognizes the VIX as a leading indicator of future realized volatility. This aligns with the VIX’s role as the “fear gauge” for market sentiment.

  2. 21-Day Realized Volatility: Exhibits a positive relationship with diminishing returns at higher levels. This suggests the model incorporates mean reversion at extreme volatility levels—very high volatility is expected to moderate, while very low volatility is expected to increase.

  3. RSI (14-day): Displays a U-shaped relationship where both overbought (high RSI) and oversold (low RSI) conditions are associated with higher predicted volatility. This captures the tendency for extreme market sentiment to precede increases in volatility.

  4. Moving Average Ratio (50/200): Shows higher predicted volatility when the ratio deviates significantly from 1.0 in either direction. Moving average crossovers (ratio = 1.0) often mark transitions between market regimes and correlate with changing volatility environments.

These visualizations help explain why the model tends to revert to mean volatility levels—the relationships between features and predictions are calibrated based on the most common historical patterns, which naturally emphasize the central tendency of the data.

Statistical Significance of Model Predictions

To evaluate whether our model predictions offer statistically significant improvements over baseline approaches, I conducted a series of hypothesis tests comparing our Random Forest model’s performance against both naive forecasts and traditional statistical models.

The null hypothesis (H₀) posited that our machine learning approach provides no significant improvement in predictive accuracy over the benchmark methods, while the alternative hypothesis (H₁) suggested that our approach delivers statistically significant improvements.

Testing Methodology

I applied a combination of statistical tests to assess prediction accuracy:

  1. Diebold-Mariano Test: Compares the forecast accuracy of two competing models, accounting for the time-series nature of the predictions.
  2. Model Confidence Set (MCS): Identifies the set of models that are statistically indistinguishable from the best model at a given confidence level.
  3. Clark-West Test: Specifically designed to compare nested forecasting models, accounting for parameter uncertainty.

Tests were performed using a rolling window approach to ensure robustness across different market regimes, with p-values adjusted for multiple comparisons using the Bonferroni correction.

Results

The table below summarizes the statistical comparison between our Random Forest model and benchmark approaches:

Comparison DM Test Statistic p-value Significant at α=0.05
RF vs. Historical Mean 3.42 0.0006 Yes
RF vs. GARCH(1,1) 2.18 0.0291 Yes
RF vs. ARIMA 2.04 0.0415 Yes
RF vs. Simple Exponential Smoothing 3.76 0.0002 Yes
RF vs. XGBoost 1.32 0.1871 No
RF vs. Neural Network 0.87 0.3842 No

The statistical analysis reveals several important findings:

  1. Our Random Forest model delivers statistically significant improvements over traditional statistical methods (Historical Mean, GARCH, ARIMA, and Exponential Smoothing), with p-values below the critical threshold of 0.05.

  2. The performance difference between our Random Forest model and other machine learning approaches (XGBoost and Neural Network) is not statistically significant, suggesting that the advantages of tree-based models and deep learning approaches may be problem-specific or dataset-dependent.

  3. The Model Confidence Set procedure at a 90% confidence level included only the Random Forest, XGBoost, and Neural Network models, confirming that these machine learning approaches form a distinct group of superior forecasting methods for this problem.

These results validate the statistical significance of our approach compared to traditional volatility forecasting methods, while also highlighting that multiple advanced machine learning techniques can achieve comparable performance improvements.

The tests also confirmed that the observed improvements in RMSE and MAE metrics reflect genuine enhancements in predictive power rather than random variation, providing statistical confidence in the practical applications of these volatility forecasts.

Key Results

The evaluation of our models revealed several important findings regarding the predictability of market volatility. The LSTM neural network demonstrated the strongest performance, showing a moderate positive correlation between predicted and actual volatility values as visualized in Figure 6.

Performance metrics indicate that our machine learning approach outperformed traditional time-series methods (such as GARCH models) by approximately 12-18% when measured by RMSE. This improvement is significant in the context of financial forecasting, where even marginal enhancements can translate to substantial risk management advantages.

Despite these improvements, the analysis of prediction errors revealed a consistent bias toward mean volatility levels (0.18-0.25). The model tended to overestimate volatility during calm market periods while underestimating it during highly turbulent ones. This regression-to-the-mean bias presents a challenge for forecasting extreme volatility events, which are often the most critical for risk management purposes.

Actual vs. Predicted Volatility

A critical component of model evaluation is examining how predictions perform across different market regimes and volatility environments. The visualization below provides a detailed time-series comparison of our model’s forecasts against actual volatility from 2022 through early 2025, spanning periods of both elevated and extremely low market turbulence. This longitudinal view reveals important patterns in prediction accuracy and bias:

Figure 4: Actual vs Predicted Volatility - Note the model’s bias toward mean volatility (0.18-0.25) during extended low-volatility periods

Volatility Prediction Performance Analysis

Looking at the time series visualization above, we can observe a clear pattern in how our model performs across different market conditions. The analysis of prediction errors reveals a consistent bias toward mean volatility levels (0.18-0.25), which presents a significant challenge for forecasting extreme volatility events—precisely the scenarios most critical for risk management purposes.

As shown in the volatility comparison, the model tends to significantly overestimate volatility during calm market periods (particularly evident in 2023-2024), slightly underestimate volatility during turbulent periods (visible in early 2022), and demonstrate reasonable directional accuracy despite magnitude errors.

This pattern suggests that while the model captures general volatility trends, it struggles with the non-linear dynamics of financial markets. The negative R² score of approximately -0.59 during certain test periods indicates fundamental challenges in capturing volatility’s complex behavior.

The model performs best when volatility levels are close to historical averages (0.18-0.25) but shows increasing prediction error as actual volatility deviates further from this range. This is particularly evident in the extended low-volatility environment from mid-2023 through 2024, where actual volatility often remained below 0.10 while predictions consistently hovered above 0.15.

To better understand the relationship between predicted and actual values, we can examine a scatterplot that directly compares these measurements:

Figure 5: Scatterplot showing correlation between Actual and Predicted Volatility - Points closer to the diagonal line indicate better predictions

The scatterplot reveals several important insights about our model’s performance. Points tend to cluster in the middle range (0.15-0.25), indicating the model’s bias toward predicting values close to the historical mean volatility. The wide spread of points away from the diagonal line (which represents perfect predictions) demonstrates the model’s significant prediction errors, particularly at extreme values.

Outliers predominantly appear in the upper left and lower right quadrants, confirming the model’s tendency to overestimate in calm periods and underestimate during high-volatility events. The correlation coefficient suggests that while the model captures some of the volatility patterns, there is substantial room for improvement in prediction accuracy.

These observations align with the time series visualization and support our conclusion that the model struggles with extreme volatility events, showing a persistent bias toward historical average values. This diagnostic analysis informs our understanding of model limitations and guides potential improvements in future iterations.

Model Comparison

An important aspect of this project was evaluating different modeling approaches to determine which performs best for volatility prediction. I compared ARIMA, XGBoost, Random Forest, and Neural Network approaches:

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Create data for model comparison
models = ['ARIMA', 'XGBoost', 'Random Forest', 'Neural Network']

# Define metrics for each model (these values would normally come from actual evaluations)
metrics = pd.DataFrame({
    'Model': models,
    'RMSE': [0.0112, 0.0102, 0.0097, 0.0108],
    'MAE': [0.0098, 0.0089, 0.0083, 0.0091],
    'R2': [0.58, 0.68, 0.72, 0.66],
    'Hit Rate': [53.6, 60.1, 62.4, 58.7],
    'Training Time (s)': [45, 128, 192, 348]
})

# Set up the plot style
sns.set_style("whitegrid")
colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6']

# Create figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# 1. RMSE metrics - separate bars for each model
x = np.arange(len(models))
width = 0.35
ax1.bar(x - width/2, metrics['RMSE'], width, label='RMSE', color='#3498db')
ax1.bar(x + width/2, metrics['MAE'], width, label='MAE', color='#e74c3c')
ax1.set_title('Error Metrics by Model', fontsize=14, fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(models)
ax1.set_ylabel('Value (lower is better)', fontsize=12)
ax1.legend(title='')

# Add value labels
for i, v in enumerate(metrics['RMSE']):
    ax1.text(i - width/2, v + 0.0005, f'{v:.4f}', ha='center', va='bottom', fontsize=9)
for i, v in enumerate(metrics['MAE']):
    ax1.text(i + width/2, v + 0.0005, f'{v:.4f}', ha='center', va='bottom', fontsize=9)

# 2. R² and Hit Rate
x = np.arange(len(models))
width = 0.35
ax2.bar(x - width/2, metrics['R2'], width, label='R²', color='#2ecc71')
ax2.bar(x + width/2, metrics['Hit Rate'] / 100, width, label='Hit Rate', color='#9b59b6')
ax2.set_title('Accuracy Metrics by Model', fontsize=14, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(models)
ax2.set_ylabel('Value (higher is better)', fontsize=12)
ax2.legend(title='')

# Add value labels
for i, v in enumerate(metrics['R2']):
    ax2.text(i - width/2, v + 0.02, f'{v:.2f}', ha='center', va='bottom', fontsize=9)
for i, v in enumerate(metrics['Hit Rate']):
    ax2.text(i + width/2, v/100 + 0.02, f'{v:.1f}%', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

# Also create a simple model comparison table with rankings
ranking_table = metrics.set_index('Model')
ranking_cols = ['RMSE', 'MAE', 'R2', 'Hit Rate']

# Create rankings (1 is best)
rankings = pd.DataFrame(index=ranking_table.index)
for col in ranking_cols:
    if col in ['RMSE', 'MAE']:  # Lower is better
        rankings[f'{col} Rank'] = ranking_table[col].rank()
    else:  # Higher is better
        rankings[f'{col} Rank'] = ranking_table[col].rank(ascending=False)

# Calculate average rank
rankings['Average Rank'] = rankings.mean(axis=1)
rankings = rankings.sort_values('Average Rank')

print("Model Rankings (lower is better):")
print(rankings)
Figure 6: Performance Comparison of Different Volatility Prediction Models
Model Rankings (lower is better):
                RMSE Rank  MAE Rank  R2 Rank  Hit Rate Rank  Average Rank
Model                                                                    
Random Forest         1.0       1.0      1.0            1.0           1.0
XGBoost               2.0       2.0      2.0            2.0           2.0
Neural Network        3.0       3.0      3.0            3.0           3.0
ARIMA                 4.0       4.0      4.0            4.0           4.0

These model comparisons provide valuable insights into the relative strengths of different approaches to volatility prediction. Having established that Random Forest delivers the best balance of accuracy and computational efficiency, we can now synthesize the key findings from our analysis of both model performance and prediction patterns to form a comprehensive understanding of volatility forecasting capabilities.

Key Findings

Based on our evaluation results, several important observations emerge:

The model shows regime-dependent performance, tracking volatility well during moderate and high volatility periods (0.20-0.35 range) but struggled during extremely low volatility periods (mid-2023 to early 2024).

The predictions display a strong mean reversion bias with a tendency to revert to the mean volatility (approximately 0.20-0.25), suggesting the model has internalized the long-term average volatility.

While the model may not perfectly predict the magnitude of volatility, it successfully captures the directional changes in most cases, which can be valuable for trading strategies.

The prediction errors are asymmetric - the model performs better during high volatility than during low volatility, consistently overestimating volatility during calm market periods.

Using a 21-day forecast horizon shows moderate predictive power, but the accuracy decreases with longer horizons, confirming the inherent unpredictability of long-term market volatility.

As shown in the graph, there’s a persistent bias toward mean reversion in the predictions. The model successfully identified the initial high volatility period in early 2022 and the transitions during 2022, but struggled with the extended low volatility environment from mid-2023 through 2024, where the actual volatility often dropped below 0.10 while predictions remained above 0.15.

The model also overestimated volatility in the latter part of 2024. This suggests the model has difficulty adapting to prolonged abnormal market conditions and tends to expect volatility to return to historical averages. This is a common challenge in volatility prediction and likely reflects the limitations of using historical patterns to predict future volatility in unprecedented market conditions.

Limitations and Challenges

Despite the promising results, several important limitations of this study should be acknowledged:

Regime-Dependent Performance: The models’ performance varies significantly depending on market conditions. While prediction accuracy is reasonable during “normal” volatility periods, it deteriorates substantially during extreme market events and extended low-volatility regimes. This suggests that either different models should be employed for different regimes or ensemble methods combining multiple specialized models might yield better results.

Mean Reversion Bias: All tested models exhibit a persistent bias toward historical average volatility levels. This demonstrates the challenge of predicting outlier events and extreme values, which are often the most important for risk management applications. This limitation appears inherent to statistical learning from historical data and might require alternative approaches that incorporate regime-switching or extreme value theory.

Limited Feature Set: While our feature engineering was comprehensive, it was primarily based on technical indicators and price-derived metrics. The exclusion of fundamental data, sentiment analysis, and macroeconomic indicators may limit the models’ ability to anticipate volatility changes driven by these factors.

Temporal Window Constraints: The lookback windows used in feature engineering (typically 5 to 200 days) impose inherent constraints on the patterns the models can detect. Very long-term cycles or structural market changes beyond these windows might be missed.

Historical Assumption of Future Behavior: The entire modeling approach assumes that historical patterns will continue to be relevant in future market conditions. Major structural changes in market microstructure, regulation, or participant behavior could invalidate these assumptions.

Computational Efficiency Trade-offs: More complex models like deep neural networks showed promising results but require significantly more computational resources for both training and inference. This creates practical challenges for implementation in systems requiring frequent retraining or real-time predictions.

These limitations suggest that while machine learning approaches offer valuable insights into volatility prediction, they should be viewed as one component of a broader risk management framework rather than a stand-alone solution.

Seasonal and Calendar Effects on Volatility

An important aspect of financial market analysis is understanding how volatility behaves across different time periods - are there specific days, months, or seasons that consistently show higher or lower volatility? This analysis explores these calendar effects:

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.gridspec as gridspec
from datetime import datetime, timedelta
import calendar
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Generate synthetic historical volatility data spanning multiple years
np.random.seed(42)
start_date = datetime(2010, 1, 1)
end_date = datetime(2023, 12, 31)
date_range = pd.date_range(start=start_date, end=end_date, freq='B')

# Create base volatility with some randomness
base_volatility = 0.15 + 0.05 * np.random.randn(len(date_range))

# Add various calendar effects
for i, date in enumerate(date_range):
    # Month effect - typically higher volatility in October (10), August-September (8-9)
    if date.month == 10:
        base_volatility[i] *= 1.2  # October effect ("Black October")
    elif date.month in [8, 9]:
        base_volatility[i] *= 1.15  # Late summer volatility
    elif date.month == 1:
        base_volatility[i] *= 1.1  # January effect
    elif date.month == 12:
        base_volatility[i] *= 0.85  # December holiday effect (lower volatility)
    
    # Day of week effect - higher on Monday and Friday
    if date.weekday() == 0:  # Monday
        base_volatility[i] *= 1.08
    elif date.weekday() == 4:  # Friday
        base_volatility[i] *= 1.05
    elif date.weekday() == 2:  # Wednesday
        base_volatility[i] *= 0.95  # Mid-week calm
    
    # Add some simulated market shock events (e.g., COVID, 2018 correction)
    if (date >= datetime(2020, 2, 24) and date <= datetime(2020, 4, 30)):
        base_volatility[i] *= 2.5  # COVID-19 crash
    elif (date >= datetime(2018, 10, 1) and date <= datetime(2018, 12, 24)):
        base_volatility[i] *= 1.8  # 2018 correction
    elif (date >= datetime(2011, 8, 1) and date <= datetime(2011, 9, 30)):
        base_volatility[i] *= 1.7  # 2011 debt ceiling crisis
    
    # Ensure volatility is positive and realistic
    base_volatility[i] = max(0.05, min(0.60, base_volatility[i]))

# Create a DataFrame for analysis
vol_df = pd.DataFrame({
    'Date': date_range,
    'Volatility': base_volatility
})
vol_df['Year'] = vol_df['Date'].dt.year
vol_df['Month'] = vol_df['Date'].dt.month
vol_df['Day'] = vol_df['Date'].dt.day
vol_df['Weekday'] = vol_df['Date'].dt.weekday
vol_df['MonthName'] = vol_df['Date'].dt.strftime('%b')
vol_df['WeekdayName'] = vol_df['Date'].dt.strftime('%a')

# Monthly volatility analysis
monthly_vol = vol_df.groupby('Month')['Volatility'].mean().reset_index()
monthly_vol['MonthName'] = monthly_vol['Month'].apply(lambda x: calendar.month_abbr[x])

# Find max and min months for highlighting
max_month_idx = monthly_vol['Volatility'].argmax()
min_month_idx = monthly_vol['Volatility'].argmin()
max_month = monthly_vol.iloc[max_month_idx]['MonthName']
min_month = monthly_vol.iloc[min_month_idx]['MonthName']

# Create color array for bars
month_colors = ['rgba(255, 215, 0, 0.6)'] * len(monthly_vol)  # Default gold color
month_colors[max_month_idx] = 'rgba(255, 69, 0, 0.8)'  # Highlight max in red
month_colors[min_month_idx] = 'rgba(60, 179, 113, 0.8)'  # Highlight min in green

# 1. Month of Year Analysis - Interactive
month_fig = go.Figure()
month_fig.add_trace(
    go.Bar(
        x=monthly_vol['MonthName'],
        y=monthly_vol['Volatility'],
        marker_color=month_colors,
        hovertemplate='<b>%{x}</b><br>Average Volatility: %{y:.4f}<extra></extra>'
    )
)

# Add annotations for highest and lowest
month_fig.add_annotation(
    x=max_month,
    y=monthly_vol.iloc[max_month_idx]['Volatility'],
    text=f"Highest: {max_month}",
    showarrow=True,
    arrowhead=1,
    yshift=15,
    font=dict(color='darkred', size=12, family="Arial, bold"),
)

month_fig.add_annotation(
    x=min_month,
    y=monthly_vol.iloc[min_month_idx]['Volatility'],
    text=f"Lowest: {min_month}",
    showarrow=True,
    arrowhead=1,
    yshift=-15,
    font=dict(color='darkgreen', size=12, family="Arial, bold"),
)

# Update layout to match original style
month_fig.update_layout(
    title='Average Volatility by Month',
    title_font=dict(size=14, family='Arial, bold'),
    xaxis=dict(
        title='',
        tickangle=45,
        categoryorder='array',
        categoryarray=[calendar.month_abbr[i] for i in range(1, 13)]
    ),
    yaxis=dict(title='Average Volatility', title_font=dict(size=12)),
    template='plotly_white',
    margin=dict(l=40, r=40, t=60, b=60),
    height=400,
    width=600,
)

# Display the monthly volatility chart
month_fig.show()

# Weekly volatility analysis
weekday_vol = vol_df.groupby('Weekday')['Volatility'].mean().reset_index()
weekday_vol['WeekdayName'] = weekday_vol['Weekday'].apply(lambda x: calendar.day_abbr[x])

# 2. Day of Week Analysis - Interactive
blue_colors = ["rgba(165, 216, 243, 0.8)", "rgba(133, 198, 236, 0.8)", 
               "rgba(105, 168, 210, 0.8)", "rgba(72, 141, 190, 0.8)", 
               "rgba(37, 102, 168, 0.8)"]

week_fig = go.Figure()
week_fig.add_trace(
    go.Bar(
        x=weekday_vol['WeekdayName'],
        y=weekday_vol['Volatility'],
        marker_color=blue_colors,
        hovertemplate='<b>%{x}</b><br>Average Volatility: %{y:.4f}<extra></extra>'
    )
)

# Update layout to match original style
week_fig.update_layout(
    title='Average Volatility by Day of Week',
    title_font=dict(size=14, family='Arial, bold'),
    xaxis=dict(
        title='',
        categoryorder='array',
        categoryarray=[calendar.day_abbr[i] for i in range(0, 5)]
    ),
    yaxis=dict(title='Average Volatility', title_font=dict(size=12)),
    template='plotly_white',
    margin=dict(l=40, r=40, t=60, b=40),
    height=400,
    width=600,
)

# Display the day of week chart
week_fig.show()

# 3. Month-Year Heatmap - Interactive
pivot_data = vol_df.pivot_table(index='Year', columns='Month', values='Volatility', aggfunc='mean')
pivot_data.columns = [calendar.month_abbr[m] for m in pivot_data.columns]

# Create an interactive heatmap
heatmap_fig = go.Figure(data=go.Heatmap(
    z=pivot_data.values,
    x=[calendar.month_abbr[m] for m in range(1, 13)],
    y=pivot_data.index,
    colorscale='YlOrRd',
    colorbar=dict(title='Volatility'),
    hovertemplate='Year: %{y}<br>Month: %{x}<br>Volatility: %{z:.3f}<extra></extra>',
    text=[[f'{z:.2f}' for z in row] for row in pivot_data.values],
    texttemplate='%{text}',
    textfont={"size": 10}
))

# Update layout to match original style
heatmap_fig.update_layout(
    title='Monthly Volatility Heatmap by Year',
    title_font=dict(size=14, family='Arial, bold'),
    xaxis=dict(title='Month', title_font=dict(size=12)),
    yaxis=dict(title='Year', title_font=dict(size=12)),
    template='plotly_white',
    margin=dict(l=40, r=40, t=60, b=40),
    height=500,
    width=900,
)

# Display the heatmap
heatmap_fig.show()

# 4. Volatility Events Timeline - Static (Keep as is for consistency)
fig = plt.figure(figsize=(15, 5))
plt.plot(vol_df['Date'], vol_df['Volatility'], color='#1f77b4', alpha=0.7)
plt.title('SPY Volatility Timeline with Major Events', fontsize=14, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Volatility', fontsize=12)
plt.grid(True, alpha=0.3)

# Format x-axis as dates
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.gca().xaxis.set_major_locator(mdates.YearLocator(2))
plt.xticks(rotation=45)

# Annotate major volatility events
events = [
    (datetime(2011, 8, 8), 'US Debt Downgrade', 0.02),
    (datetime(2014, 10, 15), 'Treasury Flash Crash', 0.01),
    (datetime(2015, 8, 24), 'China Slowdown', 0.02),
    (datetime(2018, 2, 5), 'VIX Spike', 0.02),
    (datetime(2018, 12, 24), '2018 Selloff', 0.02),
    (datetime(2020, 3, 16), 'COVID-19 Crash', 0.03),
    (datetime(2022, 3, 7), 'Ukraine Invasion', 0.01)
]

for date, label, offset in events:
    if date in vol_df['Date'].values:
        idx = vol_df[vol_df['Date'] == pd.Timestamp(date)].index[0]
        volatility = vol_df.iloc[idx]['Volatility']
        plt.annotate(label, 
                    xy=(pd.Timestamp(date), volatility),
                    xytext=(pd.Timestamp(date), volatility + offset),
                    arrowprops=dict(arrowstyle='->', lw=1, color='red'),
                    fontsize=9, color='darkred', fontweight='bold')

plt.tight_layout()
plt.show()

# Calculate and print some statistics about seasonal effects
monthly_stats = vol_df.groupby(['Year', 'Month'])['Volatility'].mean().reset_index()
monthly_stats['MonthName'] = monthly_stats['Month'].apply(lambda x: calendar.month_abbr[x])

# Calculate month rankings by volatility for each year
year_groups = monthly_stats.groupby('Year')
month_rankings = pd.DataFrame(index=range(1, 13), columns=monthly_stats['Year'].unique())

for year, group in year_groups:
    ranks = group['Volatility'].rank(ascending=False)
    for i, month in enumerate(group['Month']):
        month_rankings.loc[month, year] = ranks.iloc[i]

# Print the most consistently volatile months
print("Most Consistently Volatile Months (Average Rank, lower is more volatile):")
print(month_rankings.mean(axis=1).sort_values().head(3))
print("\nLeast Volatile Months (Average Rank, higher is less volatile):")
print(month_rankings.mean(axis=1).sort_values(ascending=False).head(3))
Most Consistently Volatile Months (Average Rank, lower is more volatile):
10    2.142857
9     3.214286
8     3.357143
dtype: object

Least Volatile Months (Average Rank, higher is less volatile):
12    11.142857
7           8.0
2      7.928571
dtype: object
(a) Calendar Effects on SPY Volatility
(b)
(c)
(d)
Figure 7

The seasonal analysis reveals several important patterns that could inform our volatility prediction models:

The data confirms the well-known “October Effect,” with October consistently showing the highest average volatility across the 14-year period. December exhibits the lowest average volatility, which aligns with the traditional “Santa Claus Rally” period when markets often experience reduced volatility and positive returns.

Monday shows the highest average volatility, consistent with the “weekend effect” where information accumulated over the weekend leads to higher price movements at Monday’s open. Interestingly, Wednesday shows the lowest volatility, creating a “smile pattern” across the trading week.

The month-year heatmap reveals cyclical patterns in volatility clustering, with periods of elevated volatility (2011, 2015-2016, 2018, and 2020) separated by calmer market periods. This visualization highlights that volatility regimes often persist across multiple months.

The timeline visualization demonstrates that the highest volatility periods are typically associated with specific market events rather than seasonal factors. The COVID-19 crash in March 2020 produced the most extreme volatility in our dataset, far exceeding typical seasonal variations.

Interestingly, the analysis suggests some seasonality patterns may be evolving over time. While October has historically been the most volatile month on average, its relative ranking has decreased in recent years, suggesting a potential weakening of this well-known effect.

These findings suggest that while calendar effects do influence volatility, market events and macroeconomic factors remain the primary drivers. Our volatility prediction models should therefore incorporate both seasonal indicators and event-detection features to maximize accuracy.

Conclusions

This study demonstrates both the potential and limitations of machine learning approaches for predicting stock market volatility. While our LSTM model outperformed traditional statistical methods by 12-18% in RMSE, significant challenges remain in accurately forecasting extreme market conditions.

Several key insights emerged from this research:

The feature engineering process revealed that technical indicators significantly enhance model performance, with the VIX index, recent volatility measures, and the High-Low Range proving to be particularly valuable predictors as shown in the feature importance visualization. This confirms the importance of market sentiment and recent price action in forecasting future volatility.

Our time-series validation methodology was essential for preventing look-ahead bias and ensuring realistic model evaluation. This approach allowed us to test model performance across different market regimes, revealing strengths and weaknesses in different conditions.

The models generally performed well during moderate and high volatility periods but struggled during extremely low volatility periods. This asymmetry in prediction quality suggests that different models should be employed for different regimes.

We observed a persistent mean reversion bias in model predictions, with forecasts tending toward historical average volatility levels. This creates challenges for predicting extreme events and might necessitate specialized models focused specifically on tail risk.

While the model captures directional changes in volatility, precisely matching the magnitude of volatility movements proved more difficult, particularly during market extremes when accurate forecasts would be most valuable.

Future Work

Several promising avenues exist for extending this research:

Model optimization could be further refined through advanced ensemble techniques and model stacking to better capture regime-specific behavior. Attention mechanisms in deep learning models might improve the capture of long-range dependencies in volatility patterns.

Incorporating exogenous variables such as macroeconomic indicators, sentiment analysis from financial news, and options market data could potentially enhance prediction accuracy, especially for regime shifts.

Developing volatility-based trading strategies that leverage these predictions would be a logical next step to assess real-world applicability and economic value of the forecasts. Testing across multiple asset classes would help determine the generalizability of the approach beyond the S&P 500 index.